attack policy
- North America > United States > Virginia (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
Enhancing Robustness of Graph Neural Networks on Social Media with Explainable Inverse Reinforcement Learning
Adversarial attacks against graph neural networks (GNNs) through perturbations of the graph structure are increasingly common in social network tasks like rumor detection. Social media platforms capture diverse attack sequence samples through both machine and manual screening processes. Investigating effective ways to leverage these adversarial samples to enhance robustness is imperative. We improve the maximum entropy inverse reinforcement learning (IRL) method with the mixture-of-experts approach to address multi-source graph adversarial attacks.
Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI
Sandoval, Aaron, Rushing, Cody
The field of AI Control seeks to develop robust control protocols, deployment safeguards for untrusted AI which may be intentionally subversive. However, existing protocols that rely on weaker monitors to detect unsafe behavior often fail on complex tasks beyond the monitor's comprehension. We develop control protocols based on factored cognition, in which a trusted model decomposes a task, an untrusted model solves each resultant child task in isolation, and the results are reassembled into a full solution. These protocols may improve safety by several means, such as by simplifying the context for monitors, or by obscuring vulnerabilities in the environment. We implement our protocols in the APPS coding setting and red team them against backdoor attempts from an adversarial GPT-4.1 Nano. We find that: (i) Adding factored cognition to a trusted monitoring protocol can boost safety from 41% to 63%; (ii) Safety improves because monitor performance improves; (iii) Factored cognition makes it no harder for capable LLMs to write backdoors in APPS. While our protocols show low usefulness in APPS, they hold promise for more complex tasks.
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring
Arnav, Benjamin, Bernabeu-Pérez, Pablo, Helm-Burger, Nathan, Kostolansky, Tim, Whittingham, Hannes, Phuong, Mary
As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while CoT monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the CoT monitors, reducing performance in obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores model reasoning and actions, and combines them using a weighted average. Our hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates twice higher than action-only monitoring for subtle deception scenarios.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.73)
Keep on Going: Learning Robust Humanoid Motion Skills via Selective Adversarial Training
Zhang, Yang, Cao, Zhanxiang, Nie, Buqing, Li, Haoyang, Jiangwei, Zhong, Sun, Qiao, Hu, Xiaoyi, Yang, Xiaokang, Gao, Yue
Humanoid robots are expected to operate reliably over long horizons while executing versatile whole-body skills. Yet Reinforcement Learning (RL) motion policies typically lose stability under prolonged operation, sensor/actuator noise, and real world disturbances. In this work, we propose a Selective Adversarial Attack for Robust Training (SA2RT) to enhance the robustness of motion skills. The adversary is learned to identify and sparsely perturb the most vulnerable states and actions under an attack-budget constraint, thereby exposing true weakness without inducing conservative overfitting. The resulting non-zero sum, alternating optimization continually strengthens the motion policy against the strongest discovered attacks. We validate our approach on the Unitree G1 humanoid robot across perceptive locomotion and whole-body control tasks. Experimental results show that adversarially trained policies improve the terrain traversal success rate by 40%, reduce the trajectory tracking error by 32%, and maintain long horizon mobility and tracking performance. Together, these results demonstrate that selective adversarial attacks are an effective driver for learning robust, long horizon humanoid motion skills.
Optimizing AI Agent Attacks With Synthetic Data
Loughridge, Chloe, Colognese, Paul, Griffin, Avery, Tracy, Tyler, Kutasov, Jon, Benton, Joe
As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.
- North America > United States > Virginia (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)
Enhancing Robustness of Graph Neural Networks on Social Media with Explainable Inverse Reinforcement Learning
Adversarial attacks against graph neural networks (GNNs) through perturbations of the graph structure are increasingly common in social network tasks like rumor detection. Social media platforms capture diverse attack sequence samples through both machine and manual screening processes. Investigating effective ways to leverage these adversarial samples to enhance robustness is imperative. We improve the maximum entropy inverse reinforcement learning (IRL) method with the mixture-of-experts approach to address multi-source graph adversarial attacks. We develop precise sample guidance and a bidirectional update mechanism to reduce the deviation caused by imprecise feature representation and negative sampling within the large action space of social graphs, while also accelerating policy learning.
CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent
Ning, Liang-bo, Wang, Shijie, Fan, Wenqi, Li, Qing, Xu, Xin, Chen, Hao, Huang, Feiran
Recently, Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience and have attracted considerable attention. Despite the impressive progress, the research question regarding the safety vulnerability of LLM-empowered RecSys still remains largely under-investigated. Given the security and privacy concerns, it is more practical to focus on attacking the black-box RecSys, where attackers can only observe the system's inputs and outputs. However, traditional attack approaches employing reinforcement learning (RL) agents are not effective for attacking LLM-empowered RecSys due to the limited capabilities in processing complex textual inputs, planning, and reasoning. On the other hand, LLMs provide unprecedented opportunities to serve as attack agents to attack RecSys because of their impressive capability in simulating human-like decision-making processes. Therefore, in this paper, we propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs, where an LLM-based agent is developed to attack LLM-Empowered RecSys. Specifically, our method first identifies the insertion position for maximum impact with minimal input modification. After that, the LLM agent is designed to generate adversarial perturbations to insert at target positions. To further improve the quality of generated perturbations, we utilize the prompt tuning technique to improve attacking strategies via feedback from the victim RecSys iteratively. Extensive experiments across three real-world datasets demonstrate the effectiveness of our proposed attacking method.
Ctrl-Z: Controlling AI Agents via Resampling
Bhatt, Aryan, Rushing, Cody, Kaufman, Adam, Tracy, Tyler, Georgiev, Vasil, Matolcsi, David, Khan, Akbir, Shlegeris, Buck
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.82)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)